Evaluating tubulointerstitial compartments in renal biopsy specimens using a deep learning-based approach for classifying normal and abnormal tubules

Renal pathology is essential for diagnosing and assessing the severity and prognosis of kidney diseases. Deep learning-based approaches have developed rapidly and have been applied in renal pathology. However, methods for the automated classification of normal and abnormal renal tubules remain scarce. Using a deep learning-based method, we aimed to classify normal and abnormal renal tubules, thereby assisting renal pathologists in the evaluation of renal biopsy specimens. Consequently, we developed a U-Net-based segmentation model using randomly selected regions obtained from 21 renal biopsy specimens. Further, we verified its performance in multiclass segmentation by calculating the Dice coefficients (DCs). We used 15 cases of tubulointerstitial nephritis to assess its applicability in aiding routine diagnoses conducted by renal pathologists and calculated the agreement ratio between diagnoses conducted by two renal pathologists and the time taken for evaluation. We also determined whether such diagnoses were improved when the output of segmentation was considered. The glomeruli and interstitium had the highest DCs, whereas the normal and abnormal renal tubules had intermediate DCs. Following the detailed evaluation of the tubulointerstitial compartments, the proximal, distal, atrophied, and degenerated tubules had intermediate DCs, whereas the arteries and inflamed tubules had low DCs. The annotation and output areas involving normal and abnormal tubules were strongly correlated in each class. The pathological concordance for the glomerular count, t, ct, and ci scores of the Banff classification of renal allograft pathology remained high with or without the segmented images. However, in terms of time consumption, the quantitative assessment of tubulitis, tubular atrophy, degenerated tubules, and the interstitium was improved significantly when renal pathologists considered the segmentation output. Deep learning algorithms can assist renal pathologists in the classification of normal and abnormal tubules in renal biopsy specimens, thereby facilitating the enhancement of renal pathology and ensuring appropriate clinical decisions.


Introduction
Renal diseases are a significant global burden in all facets of health and economy [1,2]. Therefore, the precise diagnosis of kidney diseases is a prerequisite for selecting an appropriate treatment strategy. As the golden standard for diagnosing kidney diseases, renal pathology is essential. Information obtained from renal biopsy specimens is used to confirm the diagnosis and further assess the severity and prognosis of kidney disease. Therefore, to ensure appropriate clinical decisions, the accurate assessment of renal biopsy specimens is essential.
In this study, we aimed to classify normal and abnormal renal tubules precisely by developing a segmentation model using U-Net [38], which is a representative CNN-based architecture mainly used for the segmentation of biomedical images. We improved U-Net by implementing fine finetuning and Dice cross-entropy [39,40]. We annotated the abnormal tubules in detail, including the atrophic and degenerated tubules as well as tubulitis. The automated classification of renal tubules could help renal pathologists evaluate renal biopsy specimens rapidly and accurately.

Renal biopsy specimens
We used formalin-fixed, paraffin-embedded needle-core biopsies obtained from 21 patients (7 patients 1 h after renal transplantation and 14 patients with tubulointerstitial nephritis) who underwent renal biopsy between 2000 and 2020 at Kanazawa University Hospital and its affiliated hospitals. Because various kidney diseases can involve glomeruli in addition to tubulointerstitial compartments, we needed to collect homogenous samples that involved only the  [18] (Continued ) tubulointerstitial compartments for annotation. Thus, specimens with tubulointerstitial nephritis without other involvement were used to annotate abnormal tubulointerstitial structures, whereas specimens collected 1 h after renal transplantation were nearly healthy controls to annotate normal kidney structures. In each specimen, a 2 μm section was stained using a periodic-acid Schiff staining reagent. This study was approved by the Ethical Committee of Kanazawa University (approval No. 2020-178). The ethics committee waived the requirement for obtaining informed consent from the participants because our study design is retrospective and does not involve any further tests or treatments of the participants. In addition, all data were fully anonymized before we accessed them. Further, all participants had access to the detailed information about the study, including the purpose, subjects, and content, available on our website. All subjects were allowed to withdraw from the study participation using a written form whenever they wanted. All these processes were approved by the Ethical Committee of Kanazawa University.
The annotations were carried out by a nephrologist with sufficient experience in renal pathology (S.H.). Because the number of renal pathologists is still quite small in Japan, nephrologists are trained and practice renal pathology in most facilities. The annotations performed by S.H. were double-checked by another nephrologist with sufficient renal pathology experience (M.K.) to improve the annotation quality. When the two nephrologists had different opinions, they discussed the issue and then annotated after reaching concordance.
All the normal or abnormal glomeruli were labeled as "glomeruli." Thin ascending limbs of Henle, convoluted distal tubules, and cortical collecting ducts were labeled as "distal tubules." The "arteries" included archery arteries, interlobular arteries, and arterioles. Tubules with infiltration of inflammatory cells and without atrophy or degeneration were defined as "tubulitis." The "atrophic tubules" showed narrowing of the tubular lumen owing to atrophy or the wrinkling of the tubular basement membranes, regardless of inflammatory infiltration, without tubular degeneration. The "degenerated tubules" were defined as tubular abnormalities, such as tubular vacuolation, tubular simplification, budding, loss of brush border, and cell detachment, excluding tubular atrophy and tubulitis. All other unlabeled structures were included in the "interstitium" category.
First, the kidney biopsy images were annotated with eight classes as described. Then, the eight classes were recategorized into five classes. "Proximal tubules" and "distal tubules" were recategorized into "normal tubules," whereas "atrophic tubules," "tubulitis," and "degenerated tubules" were recategorized into "abnormal tubules." The total numbers in the annotated training and test sets are listed in Table 2.

CNN design
We used U-Net for semantic segmentation. U-Net is a model that applies a CNN [38]. Finetuning was implemented using the VGG-16 model [41], which was pretrained on the Ima-geNet dataset, as the U-Net encoder. The model inputs were the image and annotation data, and the output was the label information for each pixel. We compared the segmentation models FCN, U-Net, PSP-Net, and DeepLab v3 in a preliminary study, and chose U-Net as the most suitable for the present study because it exhibited the highest accuracy and relatively clear segmented images (S1 Table and S1 Fig).
To train the model, we used 80% of the prepared images, which were randomly selected, and the remaining 20% were used to evaluate the model's performance. One image was only used for the training or the test set. The input images for the model were resized to 512×512 pixels. In addition, we standardized the color appearance by the setting of mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225) as compared to RGB. Data augmentation was performed during the training process to improve the model's generalization performance, even with a limited amount of data. We adjusted contrast and flipped horizontally at a rate of 50% and rotated in a range of -15˚to +15˚for each epoch within random ranges. For contrast adjustment, we calculated the average gray color of the input image in grayscale, and then we created an image "a" of that single gray color. Next, we overlaid the input image and image "a," where the alpha value was a numerical value between 0.5 and 1.5. The alpha value signifies the transparency, and the formula for the output image is given as follows: output = image "a" × (1.0-alpha) + input image × alpha. A value of zero signifies a solid gray image, whereas a value of one signifies that the input image remains the same. All these processes were performed using Python functions. The number of epochs was set to 200. Adam was used as the learning rate optimization algorithm, and Dice cross-entropy was used as the loss function. The output of U-Net was the probability of each label per pixel, and the label with the highest probability was assigned as the predicted label for that pixel.

Assessment of U-Net's performance
The Dice coefficient (DC), score of the similarity between two sets, was used to evaluate the segmentation accuracy. The DC for two sets A and B, which ranges from 0 to 1, is defined as follows: Because the ground truth (A) and the segmentation result (B) are similar, i.e., the model's performance is higher, the DC value becomes larger and closer to one. We calculated the DC for each label. Cross-validation was performed 20 times, and the median DC value was calculated.

Agreement rate and time comparison between renal pathologists referring to and not referring to U-Net-segmented images
To evaluate the usefulness of our U-Net algorithm, we examined the agreement ratio between two nephrologists with sufficient experience in renal pathology (R.N. and T.Z.), with and without U-Net-segmented images. For this evaluation, we selected another 15 specimens of tubulointerstitial nephritis obtained through renal biopsies between 2000 and 2020 at Kanazawa University Hospital and its affiliated hospitals. We needed to collect homogenous samples that involved only the tubulointerstitial compartments for validation. Thus, patients with tubulointerstitial nephritis without other involvement were used to estimate abnormal tubulointerstitial structures.
In each sample, a 2 μm section was stained using periodic-acid Schiff staining reagent, and we created whole-slide images for U-Net segmentation. Each renal pathologist evaluated all the biopsy specimens twice. The first assessment was performed without the reference of U-Net-segmented images (U-Net-group), and the other assessment was performed with the reference of U-Net-segmented images (U-Net+ group). There was a washout period of at least two weeks between the U-Net-and U-Net+ groups to avoid habituation effects on the samples. The order of evaluation was crossed: U-Net-!U-Net+ group in nine cases and U-Net +!U-Net-group in six cases. In each review, renal pathologists examined the (1) glomerular count, the (2) Banff t, ct, and ci scores [42], and the (3) percentage of tubulitis, tubular atrophy, degenerated tubules, and interstitial spaces. Each pathologist recorded the total time taken.

Statistical analysis
Interclass correlation coefficient (ICC) values (2.1) were calculated for the agreement ratio of continuous variables among the renal pathologists. Cohen's κ was calculated for the agreement ratio of categorical variables among the renal pathologists. Non-parametric parameters of the two groups were compared using the Mann-Whitney U test. The areas of output were compared with those of annotations using linear regression analysis, and the coefficients of determination were calculated. The significance level for all the analyses was set at 0.05.

Segmentation performance of U-Net for detecting abnormal tubules
First, we performed the semantic segmentation of five classes (glomeruli, normal tubules, abnormal tubules, arteries, and the interstitium) to clarify whether our U-Net can distinguish between normal and abnormal tubules. Representative examples of the ground truth and segmentation masks used in the test set are shown in Fig 1. The multiclass segmentation performance of U-Net was evaluated and calculated using the DCs listed in Table 3. The highest DCs obtained were for the interstitium and glomeruli. Normal and abnormal tubules had middle DCs. A low DC was observed in the arteries. A confusion matrix shows the way in which one class could be misidentified as a different class (Table 4). Normal tubules were often misidentified as the interstitium but not as abnormal tubules, whereas abnormal tubules were often misidentified as normal tubules (19%) or the interstitium (17%). Arteries were mostly misidentified as the interstitium (64%).

Detection of different types of abnormal tubules using U-Net
Next, we performed the semantic segmentation of eight classes (glomeruli, proximal tubules, distal tubules, atrophied tubules, tubulitis, degenerated tubules, arteries, and the interstitium) to verify whether our U-Net can be used to detect different types of abnormal tubules in detail. Representative examples of the ground truth and segmentation masks used in the test set are shown in Fig 2. The multiclass segmentation performance of the U-Net was evaluated using the DCs listed in Table 3. The highest DCs were obtained from the interstitium and glomeruli as well as from the five classes of semantic segmentation. Proximal tubules, distal tubules, atrophied tubules, and degenerated tubules had intermediate DCs. Arteries and tubulitis had low DCs. In the confusion matrix, proximal tubules were misidentified as the interstitium (13%) or as degenerated tubules (11%) ( Table 5). Distal tubules were misidentified as the interstitium (14%). Arteries were mostly misidentified as the interstitium (60%). Tubulitis was misidentified as the interstitium (21%), distal tubules (15%), or degenerated tubules (15%). Degenerated tubules were misidentified as proximal tubules (17%) or the interstitium (16%). Atrophied tubules were misidentified as the interstitium (17%) or as degenerated tubules (10%).
We also quantified the areas of each class using U-Net to determine whether the algorithm could precisely estimate the area of normal and abnormal tubulointerstitial lesions (Fig 3), which directly resulted in a reasonable prediction of renal prognosis. We found a strong correlation between annotations and the segmentation model predictions in the glomeruli, proximal tubules, distal tubules, and the interstitium. Various abnormal tubules, such as tubulitis, degenerated tubules, atrophied tubules, and arteries, were also moderately correlated between annotations and segmentation model predictions.

Application of U-Net-segmented images to diagnostic situations by renal pathologists
Finally, we evaluated the usefulness of U-Net-segmented images as an aid for routine diagnostic work performed by renal pathologists. We investigated whether referring to five classes of U-Net-segmented images would improve the agreement ratios between two renal pathologists when evaluating tubulointerstitial findings in renal biopsy specimens and the time required for evaluation.
The ICCs for the glomerular count were 0.97 and 0.95 for the U-Net-and U-Net+ groups, respectively ( Table 6). The Cohen's κ values of the Banff t, ct, and ci scores were similar at high levels in both groups, ranging from 0.91 to 0.92 in the U-Net-group and 0.81 to 0.94 in the U-Net+ group. The ICCs for the quantitative evaluation of areas in tubulitis, tubular atrophy, degenerated tubules, and the interstitium were low in the U-Net-group (0.14-0.59). However, in the U-Net+ group, the ICCs improved significantly (0.52-0.81), except for degenerated tubules (0.17). Furthermore, referring to the U-Net-segmented images improved the median time for evaluation from 317 s to 214 s [214 s {interquartile range1 (IQR1)180, IQR3 280} in the U-Net+ group vs. 317 s (IQR1 260, IQR3 371) in the U-Net-group; p = 0.044].

Discussion
In this study, we developed a U-Net-based segmentation model to classify the multisystem compartments of renal biopsy specimens primarily related to normal and abnormal tubules. Our developed U-Net could classify normal and abnormal tubules with high accuracy. However, it was still challenging to identify the exact type of abnormal tubules. On the other hand, The ground truth labels are given vertically, and the segmentation model's predictions are given horizontally.
https://doi.org/10.1371/journal.pone.0271161.t004 our U-Net was suitable for the quantitative evaluation of the area in each class and was helpful as an aid for renal pathologists in evaluating tubulointerstitial lesions among renal biopsy specimens.
In this study, we annotated the most significant number of tubular components to discriminate the types of abnormal tubules by adopting U-Net, which is used for the semantic segmentation of kidney histology [4,17,24,31,32]. Hermsen et al. achieved multiclass segmentation through U-Net, which showed high DCs on multiclass structures, using whole-slide images obtained from multicenter institutions [4]. Normal tubules were detected highly, but the DCs of both atrophic and undefined tubules were low (0.49 and 0.30, respectively) [4]. In this study, we prepared the most significant amount of annotated data for different types of normal and abnormal tubules, and the detection rate of atrophic tubules was improved. Degenerated tubules were moderately detected, but the model's performance in detecting tubulitis was low. This may be as a result of the diversity of abnormal tubular findings and the fact that different types of abnormal tubular findings often coincide within the same tubules.
The second notable point of the present study is that we improved U-Net by implementing finetuning and Dice cross-entropy. For finetuning, we used the VGG-16 model [41], which was pretrained on the ImageNet dataset, as the U-Net encoder. The introduction of finetuning did not change the accuracy but shortened the learning time taken. It needed about 150 epochs without finetuning to maintain high accuracy, whereas approximately 90 epochs were needed with finetuning. In addition, we adapted Dice cross-entropy as a loss function. Dice crossentropy is a combination of Dice loss and cross-entropy [39,40]. Dice cross-entropy improved accuracy more than other loss functions such as focal loss and cross-entropy in our preliminary study. We believe that the use of Dice cross-entropy in renal pathological studies is lacking. Recently, studies have been conducted to detect tubulointerstitial abnormalities using various methodologies. Ginley et al. developed a DeepLab v2-based algorithm to assess interstitial fibrosis and tubular atrophy (IFTA) and glomerulosclerosis in native and transplanted kidneys [10]. They achieved the automated detection and quantification of IFTA lesions by setting IFTA collectively without considering each compartment of IFTA. Bouteldja et al. conducted the multiclass segmentation of healthy and five murine disease models using U-Net [31]. They extracted tubular dilation and atrophy by measuring the tubular diameter. Yi et al. constructed a deep learning-based model through the combination of a mask region-based CNN and U-Net algorithms to recognize normal and abnormal tissue compartments in transplant kidneys, including the Banff t, ci, and ct scores [24]. They applied their algorithms to the prediction of graft survival. Furthermore, Salvi et al. employed two different U-Nets, denoted TSC and TCC, and obtained excellent performance in tubular segmentation (DC = 0.92) [11].  Essentially, although it is still challenging to determine the types of abnormal tubules using U-Net alone, in addition to increasing the validity and the number of annotations, the improvement of deep learning-based methods and their combination with clinical information would be required to improve accuracy in the detection of different types of abnormal tubules and enhancing its significance in clinical outcomes.
Another noteworthy aspect of this study is that referring to the U-Net-segmented images can help renal pathologists in evaluating tubulointerstitial lesions accurately and rapidly. The five-class segmented images were visually easier to understand and more accurate than those of the eight-class segmentation. Therefore, the five-class segmentation images were used to assist renal pathologists in evaluating renal biopsy specimens. The glomerular count and tubulointerstitial compartments of Banff scoring showed the highest agreement with and without U-Net-segmented images. However, interestingly, in the quantitative evaluation of tubular abnormalities, which are more difficult for renal pathologists to assess, U-Net significantly improved the interpathologist agreement ratios, except for degenerated tubules. This may be as a result of the high correlation between the U-Net-segmented and annotated regions in each class. Because abnormal tubulointerstitial areas are associated with worsening renal prognoses in various kidney diseases [26,[33][34][35][36][37], the accurate assessment and quantification of odd tubular areas would improve the quality of the prediction of renal prognosis. Furthermore, the improvement in the time required for evaluation by referring to the segmented images using U-Net is another advantage of U-Net in the reduction of the physical burden on renal pathologists [10]. This includes the development of an application for automated detection and quantification, which would help renal pathologists estimate renal prognosis promptly. In addition, the link between U-Net-based segmentation and clinical information would be useful to predict renal prognosis more precisely. This would notably improve the estimation of renal prognosis compared with the current method of semi-quantification of tubulointerstitial compartments in both native kidney specimens [43] and the Banff-grading system of kidney allografts [42].
This study has several limitations. First, our developed U-Net did not recognize tubules as single structures, and different normal and abnormal tubules were mixed within a single tubule, thereby resulting in lower DCs. Second, a relatively small number of renal pathologists participated in this study to validate the usefulness of referring to U-Net-segmented images. Finally, our developed U-Net had a significantly low accuracy for the "arteries" class. The number of annotated arteries was small. Specifically, the number of annotated arteries was 256 of 311 regions taken and 80% of them were used for training and the remaining 20% for testing. This is insufficient for U-Net to train for detecting arteries in the test set. In addition, the size of the arteries was extremely small compared with other compartments. The areas of "arteries" are approximately one-fortieth of those of "interstitium." Thus, "arteries" tended to be misrecognized as "interstitium." This study focused on tubulointerstitial structures, and further examination is required to scan the entire renal biopsy specimens, including the arteries.
In conclusion, our deep learning algorithm assisted renal pathologists in detecting and quantifying different types of normal and abnormal tubules in renal biopsy specimens. However, because the current algorithm is still insufficient for the automated detection and classification of different types of abnormal tubules, we must improve its predictive accuracy. Nevertheless, our current algorithm can be expected to help renal pathologists evaluate renal biopsy specimens accurately and rapidly, thereby contributing to highly appropriate clinical decisions.
Supporting information S1 Fig. Representative images of ground truth and eight-class segmentation using various deep learning methods. PAS-stained slide, ground truth, and segmentation using U-Net. The top row represents a normal specimen, and the second through fourth rows represent specimens with tubulointerstitial nephritis. (TIF) S1